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Abstract 

In this note, we shortly survey some recent approaches on the approx- 
imation of the Bayes factor used in Bayesian hypothesis testing and in 
Bayesian model choice. In particular, we reassess importance sampling, 
harmonic mean sampling, and nested sampling from a unified perspective. 
Keywords: Bayesian inference, Bayes factor, importance sampling, nested 
sampling, Monte Carlo 



1 Introduction 

The Bayes factor is a fundamental procedure that stands at the core of the 
Bayesian theory of testing hypotheses, at least in the approach advocated by 



Jeffreys 



Jeffreys 



(1939). (Robert et al. (20091 provides a reassessment of the role of 



( 1939 ) in setting a formal framework for Bayesian testing) and by Jaynes 



(2003 1. Given an hypothesis H : 6 £ Q °n the parameter 9 g 6 of a statistical 



model, with density f(x\0), under a compatible prior of the form 

7r(e o )7r o (0)+ 71(6^(0), 
the Bayes factor is defined as the posterior odds to prior odds ratio, namely 

ir(e \x) /7r(0o) 



Boi(x) 



'e / Je^ 

Since model choice can be considered from a similar perspective, under the 



Bayesian paradigm (see, e.g., Robert (2001 1), the comparison of models 
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where the family 3 can be finite or infinite, leads to the same quantities, 



Pi fi(z\9i)iri(9i)d8i / Pj /,(■'■ (',)-,(() , U\(), . i,ie3 



In this short survey, we consider some of the most common Monte Carlo solu- 
tions used to approximate a generic Bayes factor or its fundamental component, 
the evidence 



3k 



n k (6 k )L k (6 k )d6 k , 



aka the marginal likelihood. Longer entries can be found in |Carlin and Chib 



(1995 



(2008 



Chen et al. (20001, Robert and Casella (2004), or Friel and Pettitt 



Note that we do not mention here trans-dimensional methods issued 



from the revolutionary paper of Green ( 1995 ) , since our goal here is to demon- 



strate that within-model simulation allows for the computation of Bayes factors 
and thus avoids the additional complexity involved in trans-dimensional meth- 
ods. 



2 Importance sampling solutions 

While a regular importance sampling approach is feasible towards the approxi- 
mation of the Bayes factor 



/i(;r|0i)7ri(0i)d0i 
f2(x\e 2 )7r 2 (0 2 )de 2 



t>\ 2 — —jr- 



as for instance in 



B\ 2 



H x Y,Zxh{x\e 2 )^ 2 {ei)/w 2 {ei) 



which relies on importance functions (densities) W\ and w 2 and on simulations 
6\ ~ vji and 0\ ~ ru 2 , specific solutions targeted toward Bayesian model choice 
are indeed available and preferable. Most of those solutions fit under the de- 
nomination of bridge sampling and aim at taking advantage of the connections 
between the two models under comparison. In fact, when comparing two models 
with the same complexity (i.e., the same dimension for their respective param- 
eter spaces), it is often possible to find a reparamcterisation of both models 
in terms of some specific moments of the sampling model, like E[X], so that 
parameters under both models have a common meaning. 



2.1 Bridge sampling 

Assuming that the parameters of both models under comparison, 9\ and 2 
respectively, thus belong to the same parameter space (i.e., 0i = & 2 ), a first 
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solution is to syndicate simulations between both models in order (a) to recycle 
simulations under one model for the other model and (b) to create correlation 
between the estimates of the numerator and of the denominator of the Bayes 
factor in order to improve stability in the estimate. This solution is made clear 



with the formula of |Gelman and Meng (19981 



B12 



1 >A tt 1 (9 2i \x) 
n^ i n 2 (e 2l \x) 



when #2i ~ i"2(0|#)j where 



7Ti(0i|a;) cx 7iri(0i|x) 

7T 2 (0 2 \x) CX TT 2 (9 2 \x) , 

as in most Bayesian settings. (An extension to the cases when Oj C 02, in- 
cluding those when the dimension of 0i is smaller than the dimension of 02, 



can be easily derived, as shown in Chen et al. (2000). Note that the assumption 



©l = 02 signifies that the representations of both models have been reparame- 
terised in terms of the same moments.) 



This is a very special case of the general representation (Torrie and Valleau 



1977) 



E y [7Tl(fl)Mfl)] 

12 e v [Me) Me)] ' 

which holds for any density ip with a sufficiently large support and requires a 
single sample 6\ , . . . , 6 n generated from ip to produce an importance sampling 
ratio estimate. In that case, a quasi-optimal solution is provided by |Chen et al. 



(20001, namely <p*(0) oc | Tti(9) — tt 2 (0) |. The missing normalising constants in 
both tti and ir 2 obviously mean that this solution cannot be used per se. In 
fact, considering the very special case when tti(0) — tt 2 (0) on some region of 
the parameter space, we see that the solution <p*(0) should not be used because 
it is null on some portion of the support of w± and tt 2 , thus contradicting a 
fundamental requirement of importance sampling. 

Another extension of this bridge sampling approach can be based on the 
general representation 



B 



12 



TT2(0\x)a(0)w 1 (0\x)d0 
^ 1 (0\x)a(0)Tr 2 (0\x)d0 

EW2(0ii\x)a(0 li ) 



ti i 



i=l 



-t '1-2 

-E 



no 

ni{02i\x)a{02i 



where 0ji ~ Kj{0\x), which applies for any positive and integrable function 
a. Some choices of a do lead to very poor performances of the method in 
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connection with the harmonic mean approach (Raftery et al. 2007[), but there 
exists a quasi-optimal solution, as provided by Gelman and Meng ( 1998 ): 



1 

or oc 

n 1 TTi(9\x) + ri2ir 2 (0\x) 

Once again, the optimum cannot be used per se, since it requires the normalis- 



ing constants of both 7i"i and 7r 2 . As suggested by Gelman and Meng (1998), an 



approximate version uses iterative versions of a*, based on successive iterates 
of approximations to the Bayes factor. Note that this solution recycles simula- 
tions from each posterior, which is quite appropriate since one model is selected 
via the Bayes factor, instead of using an importance sample common to both 
approximations. We will see below an alternative representation of the bridge 
factor that bypasses this difficulty 



2.2 Harmonic means 

While using the generic harmonic mean approximation to the marginal likeli- 



hood is often fraught with danger (Neal 1994 Chopin and Robert 2007a I, the 
representation ( Gelfand and Dey||1994 l 

<p(0 k ) 1 f <p{B k ) 7r k (9 k )L k (9 k ) 1 

d9 k = — 



E 71 



n k {0k)L k (6 k ) 



K k {0 k )L k {6 k ) 



(1) 



holds, no matter what the density ip(-) is. This representation is remarkable 
in that it allows for a direct processing of Monte Carlo or MCMC output from 
the posterior distribution. In addition, and as opposed to usual importance 
sampling constraints, the density ip(9) must have lighter — rather than fatter — 
tails than Tt k (9 k )L k (9 k ) for the approximation of the Bayes factor 




to enjoy finite variance. Therefore, using ip(6 k ) = Tr k (9 k ) as in the original har- 
monic mean approximation (Newton and Raftery 1994J will often result in an 
infinite variance, as discussed by (Neal 1994). On the opposite, using <p's with 
constrained supports derived from a Monte Carlo sample, like the convex hull 
of the simulations corresponding to the 10% or to the 25% HPD regions — that 
again is easily derived from the simulations — is both completely appropriate and 
implement able, as illustrated by Figure [T] for a toy example. In this example, 
we used the simulations within the HPD region to define an ellipse and conse- 
quently a uniform density ip over this ellipse. Since the true "evidence" can be 
computed analytically, checking the convergence of the harmonic approximation 
is straightforward. (We warn the reader that this "evidence" cannot be used 
in a model comparison framework, because it is associated with an improper 



prior that is not acceptable in testing settings. See DeGroot (19731 or Robert 



(20011 for more details. Nonetheless, it provides a valid toy example to check 



the convergence of an integral approximation.) 
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Figure 1: (left) Representation of a Gibbs sample of 10 3 parameters (8, a 2 ) for 
the normal model, x\, . . . , x n ~ Af(0, a 2 ) with x = 0, s 2 = 1 and n = 10, under 
Jeffreys' prior, along with the pointwise approximation to the 10% HPD region 
(in darker hues), (right) Evaluation of the approximation of the evidence based 
on the density <p for the uniform distribution on the ellipse approximating this 
HPD region. 



2.3 Mixture bridge sampling 

As noted above, a remarkable feature of the representation (JlJ is that the derived 
implementation can directly exploit the output of any MCMC sampler. Another 



approach introduced in Chopin and Robert (2007b) aims at the same goal and 



attains the optimal bridge sampler from a completely different perspective. It 
considers a specific mixture structure as importance function, of the form 

tp(6) cx luit(8)L(8) + <p(8) , 

where ip() is an arbitrary but fully normalised density. Simulating from this 
mixture, assuming there already exists an MCMC sampler with stationary dis- 
tribution ir(8\x) cx ir(8)L(8), is straightforward, thanks to a tailored Gibbs 
sampler: 

Mixture Gibbs sampler 
At iteration t 

1. Take 5^ = 1 with probability 
and <JW — 2 otherwise; 
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2. If 5W = l, generate Of ~ MCMC(^* ^flfc) where MCMC(8 k ,8' k ) de- 
notes an arbitrary MCMC kernel associated with the posterior 7Tfc(#fc|a:) oc 

7i'fc(ffc)ife(6 , fe); 

3. If 5® = 2, generate ~ y(0fe) independently 

The simulation step 1 . selecting between both components of the mixture is 
not only allowing for simulation from this mixture, but it also does provide a 
direct estimate to the evidence. Indeed, the Rao-Blackwellised estimate 

T 

converges to Wi3fc/{wi3fc + 1} and we can thus deduce 
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We have thus recovered the optimal bridge sampling estimate from this different 



perspective, akin to Bartolucci et al. (2006), that is more in line with reversible 



jump trans-dimensional schemes than regular importance sampling. The only 
modification compared with the original version is that the sequence (8 k ) is 
generated from the mixture for both the numerator and the denominator. Figure 
[2] shows that, for the toy example introduced in Figure [T] the harmonic mean 
approximation does as well as the optimal bridge sampling solution. 



3 Nested sampling 



This method introduc ed in[Sk illing (2006, 2007) (although an earlier version can 
be found in Burrows ( 1980 1) produces a very specific type of importance sam- 
pling based on constrained simulations from the prior distribution. While more 
details descriptions are available in the above reference and|Chopin and Robert| 



( 2007b I (as well as a full convergence assessment in the later paper) , let us recall 
here that nested sampling is based on the one-dimensional representation 



3 = E*[L(0)] 



<p(x) dx 



of the evidence, when 



V-\1)=P*{L{8)>1) 



is the survival probability function associated with the likelihood. The approx- 
imation of 3 by a Riemann sum: 



N 



(Xi-i - Xi)(f(Xi) 



i=l 



6 




~i 1 r 

Oe+OO 2e+04 4e+04 6e+04 8e+04 




Figure 2: For the same setting as Figure [I] (left) Evaluation of the bridge sam- 
pling approximation of the evidence based on the mixture if when the density (p 
corresponds to the uniform distribution on the ellipse approximating the same 
HPD region as in Figure [TJ using a value of to equal to one-tenth of the true 
evidence; (right) boxplot comparison of the variations of both approaches based 
on 100 Monte Carlo replicas with samples of size 10 4 . On both graphs, the true 
evidence is represented by the horizontal dotted line. 



where the cither deterministic, e.g. Xi = exp{— i/N}, or random, also 

allows for the representation 

N-l 

3=J^ {(p(x i+1 ) - ip.(xi)}xi 

i=0 

which is a special case of 

3 = ]T {L(% +1 )) - l(6) > £(%))}) (2) 

2=0 

where • • • L(9^ + i^) > L(9^) ■ ■ ■ . (This can be seen as the Lebesgue version of 
the Riemann's sum, the triangulation being on the second axis instead of the first 
axis.) Since ip is rarely available in closed form, the nested sampling algorithm 
relies on an estimate of ip(xi) or, equivalently, of tt({8; L(9) > L(6^)}): 

Nested sampling algorithm 

Start with N values 6\, . . . , Off sampled from ir 

At iteration i, 

1. Take ipi = L(9k), where 9k is the point with smallest likelihood in the pool 
of 9i's 
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2. Replace 9k with a sample from the prior constrained to L(9) > <pf the current 
N points are sampled from prior constrained to L(9) > ipi. 

In terms of the representation (J2|, this amounts to use the approximation 
£({0; L(9) > L(0 w )})/tt({0; L(9) > = (N - l)/N . 



As discussed in Evans| ( |20Q7| ) and Chopin and Robert (2007c I, the dominat 



ing term in the approximation is the stochastic part that converges at 0(^/n) 
speed. The method thus formally compares with those mentioned in the pre- 
vious section. At a more practical level, nested sampling can be interpreted as 
an importance sampling technique where, instead of simulating a whole sample 
from the prior distribution, tt(9), and approximating the evidence by 



1 N 



which is usually inefficient, points 9^ are successively sampled from the prior 
restricted to higher and higher levels of the likelihood, with decreasing weights 



N \ N 



To illustrate the comparison with a standard importance sampling approx- 
imation, we now consider an artificial example based on the likelihood of a 
twisted normal distribution (first introduced in ( Haario et al.|1999 1 as a bench- 



mark for adaptive MCMC schemes) in two dimensions with covariance ma- 
trix E = diag(crf,l). The "twist" is due to considering the transform 9' 2 = 
02 + (3(0 1 — erf), which leads to a sharp bend in the likelihood contours, as 
shown in Figure [3j Since the Jacobian of the twist is equal to 1, the density is 
thus defined as: 

*a) - 2 + " *?)) ~ AA 2 (0, E) . 

If we consider (3 and a\ as known, the appeal of this example is in integrating 
over the parameters 9\ and 02 with priors tt^i,^)- The toy evidence can thus 
be represented as 

3i= J ^(9 1 ,9 2 )7T(9 1 ,9 2 )d9 

For the example that we consider, we fix (3 = 0.03, a\ — 100 (as represented in 
Figure [3]) and we use flat priors on 9\ in (—40,40) and on #2 also on (—40,40). 
(The prior square is chosen arbitrarily to allow all possible values and still 
to retain a compact parameter space. Furthermore, a flat prior allows for an 
easy implementation of nested sampling since the constrained simulation can 



be implemented via a random walk move, as pointed out in |Skilling| ( 2006 ).). 
Integrating the likelihood over this region of the parameter space presents a 
challenging problem for any approach as the coverage of the tails of the twisted 
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Figure 3: Contours (at the levels 68%, 95% and 99.9%) of the likelihood of the 
twisted normal model for (3 = .03, a\ = (100). 



normal distribution can be difficult or even impossible to capture. (This point 



is discussed at length in Wraith et al. (20091.) 

In this toy example, the two-dimensional nature of the parameter space does 
allow for a numerical integration of 3i, thus producing a reference value, based 
on a Riemann approximation of the integral and a grid of 1000x1000 points in 
the (-40, -40) x (40,40) square (an adaptive quadrature approach was also used 
as a check). This approach leads to a stable evaluation of 3i that can be reliably 
taken as the reference against which we test alternative approximation methods. 

The comparison is here restricted to a standard nested sampling algorithm 



derived from Skilling (20061 and a population Monte Carlo (PMC) mixture 
importance sampler constructed in Wraith et al. (20091 for this benchmark 



problem and introduced in Cappe et al. (20081 in a general framework. Briefly 



this importance sampling approach is adaptive as it consists of modifying the 
parameters of the importance function (a mixture density), bringing it closer 
to the posterior density over a small number of iterations. The proximity is 
measured in terms of the Kullback divergence between the posterior density 
and importance function, since utilising an integrated EM approach ensures 
that the divergence successively decreases at each iteration. While this adaptive 
importance sampler derives a proposal ip to simulate from the (pseudo-)posterior 
tp(0i, 82)^(01, 82)1 it can obviously provide in addition an approximation of the 
marginal likelihood 3i- (We stress that any importance sampler used in this 
setting offers this facility of providing an approximation of both the evidence 
and of its variability.) 

For comparison purposes, the PMC approach uses 5000 simulated points 
per iteration over a total of 10 iterations and then a final sample of 50000 
points, simulated from the "optimal" importance sampling function obtained 
through the PMC sequence. The importance function to be optimised consists 
of a mixture of 9 multivariate Student t's with 9 degrees of freedom for each 
component. For the initial values of the importance function, components of 
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i : m.; 



Figure 4: Comparison of nested sampling with PMC over 100 simulation runs 
for (left) Evidence estimation; (centre) E[9i] estimation; (right) £[62] estimation. 
The true value is represented as an horizontal dotted line. 



the mixture are located randomly in different directions slightly away from 0: 
the mean of the components are drawn from a bivariate Gaussian distribution 
with mean and covariance equal to So/5, where £0 is a diagonal matrix with 
diagonal entries (200, 50). In parallel, we run the nested sampling algorithm with 
N = 1000 initial points, reproducing the implementation of Skilling ( 2006 ) , using 
50 steps of a random walk in (61,82) constrained by the likelihood boundary to 
produce the next value, based on the contribution of the current value of {6\, 9 2 ) 
to the approximation of 3i- The step size (ie. the variance) in the random walk 
is 0.1 and the process is repeated (iterated) 10,000 times (and monitored) to 
ensure a definitive completion of the accumulation of 3i- Alternative scenarios, 
including changes to the number of points N, to the number of steps and to the 
step size were explored to assess the sensitivity of the results to the values set, 
but they did not lead to an improvement in the results. To assess the variability 
of the results, 100 simulation runs for both PMC and nested sampling algorithms 
have been used. 

Figure [4] summarises our results for PMC compared with nested sampling 
over the 100 simulation runs for evaluation of the evidence 3i and of the poste- 
rior mean for 9i (E[#i]) and 62 (K[8i\), since the outcome of a nested sampling 
run can be utilised as any importance sampling output. Those results suggest 
that nested sampling exhibits a slight upward bias for the evaluation of the evi- 
dence (a point also noted in Chopin and Robert ( 2007b[ )) while it approximately 
produces the same numerical value for the estimates of E[#i] and of £[#2], albeit 
with a greater variability. 



4 Comments 

Various importance sampling strategies have been proposed recently that ex- 
plicitly target the evidence. While there is no clear winner emerging from the 
comparison, we conclude that the bridge sampling strategy remains a reference 
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in this domain, but also that the harmonic version of 



(19941, 



Gelfand and Dey 



Bartolucci et al.| ( |2006[ ), |Chopin and Robert| ( |2007b I may produce valuable ap- 
proximations if the empirical HPD regions are exploited in the way described 
in the current paper. 
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